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Abstract 

We analyze a few of the commonly used statistics based 
and machine learning algorithms for natural language 
disambiguation tasks and observe that they can be re- 
cast as learning linear separators in the feature space. 
Each of the methods makes a priori assumptions, which 
it employs, given the data, when searching for its hy- 
pothesis. Nevertheless, as we show, it searches a space 
that is as rich as the space of all linear separators. 
We use this to build an argument for a data driven 
approach which merely searches for a good linear sepa- 
rator in the feature space, without further assumptions 
on the domain or a specific problem. 

We present such an approach - a sparse network of 
linear separators, utilizing the Winnow learning algo- 
rithm - and show how to use it in a variety of ambiguity 
resolution problems. The learning approach presented 
is attribute-efficient and, therefore, appropriate for do- 
mains having very large number of attributes. 

In particular, we present an extensive experimental 
comparison of our approach with other methods on 
?11 studied lexical disambiguation tasks such 



several we I 



a a context-sensitive spelling correction, prepositional 
phrase attachment and part of speech tagging. In all 
cases we show that our approach either outperforms 
other methods tried for these tasks or performs com- 
parably to the best. 



Introduction 

Many important natural language inferences can be 
viewed as problems of resolving ambiguity, either se- 
mantic or syntactic, based on properties of the sur- 
rounding context. Examples include part-of speech 
tagging, word-sense disambiguation, accent restoration, 
word choice selection in machine translation, context- 
sensitive spelling correction, word selection in speech 
recognition and identifying discourse markers. In each 
of these problems it is necessary to disambiguate two 
or more [semantically, syntactically or structurally]- 
distinct forms which have been fused together into the 



same representation in some medium. In a prototypi- 
cal instance of this problem, word sense disambiguation, 
distinct semantic concepts such as interest rate and 
has interest in Math are conflated in ordinary text. 
The surrounding context - word associations and syn- 
tactic patterns in this case - are sufficient to identify 
the correct form. 

Many of these are important stand-alone problems 
but even more important is their role in many applica- 
tions including speech recognition, machine translation, 
information extraction and intelligent human-machine 
interaction. Most of the ambiguity resolution problems 
are at the lower level of the natural language inferences 
chain; a wide range and a large number of ambigui- 
ties are to be resolved simultaneously in performing any 
higher level natural language inference. 

Developing learning techniques for language disam- 
biguation has been an active field in recent years and 
a number of statistics based and machine learning 
techniques have been proposed. A partial list con- 



sists of Bayesian classi fiers ( Sale, Church, Sz Yarowsky 
1993), decision lists (Yarowsky 1994), Bayesian hv 
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"brlds (IGolding 1995| ) HMMs (KJharniak 1993 ), induc- 
tive logic metho ds (fccllc & Mooney 1996), mcmory - 
based methods ( Zavrcl, Daclemans, fc Veenstra 1997 ) 
and transformation-based learning ( Brill 1995|) . Most 
of these have been developed in the context of a spe- 
cific task although claims have been made as to their 
applicativity to others. 

In this paper we cast the disambiguation problem as 
a learning problem and use tools from computational 
learning theory to gain some understanding of the as- 
sumptions and restrictions made by different learning 
methods in shaping their search space. 

The learning theory setting helps in making a few 
interesting observations. We observe that many algo- 
rithms, including naive Bayes, Brill's transformation 
based method, Decision Lists and the Back-off estima- 
tion method can be re-cast as learning linear separators 
in their feature space. As learning techniques for linear 
separators these techniques are limited in that, in gen- 
eral, they cannot learn all linearly separable functions. 
Nevertheless, we find, they still search a space that is as 
complex, in terms of its VC dimension, as the space of 



all linear separators. This has implications to the gen- 
eralization ability of their hypotheses. Together with 
the fact that different methods seem to use different a 
priori assumptions in guiding their search for the linear 
separator, it raises the question of whether there is an 
alternative - search for the best linear separator in the 
feature space, without resorting to assumptions about 
the domain or any specific problem. 

Partly motivated by these insights, we present a new 
algorithm, and show how to use it in a variety of dis- 
ambiguation tasks. The architecture proposed, SNOW, 
is a Sparse Network Of linear separators which utilizes 
the Winnow learning algorithm. A target node in the 
network corresponds to a candidate in the disambigua- 
tion task; all subnetworks learn autonomously from the 
same data, in an on line fashion, and at run time, they 
compete for assigning the correct meaning. The archi- 
tecture is data-driven (in that its nodes are allocated as 
part of the learning process and depend on the observed 
data) and supports efficient on-line learning. Moreover, 
The learning approach presented is attribute-efficient 
and, therefore, appropriate for domains having very 
large number of attributes. All together, We believe 
that this approach has the potential to support, within 
a single architecture, a large number of simultaneously 
occurring and interacting language related tasks. 

To start validating these claims we present experi- 
mental results on three disambiguation tasks. Prepo- 
sitional phrase attachment (PPA) is the task of decid- 
ing whether the Prepositional Phrase (PP) attaches to 
the noun phrase (NP), as in Buy the car with the 
steering wheel or to the verb phrase (VP), as in Buy 
the car with his money. Context-sensitive Spelling 
correction (Spell) is the task of fixing spelling errors 
that result in valid words, such as It's not to late, 
where too was mistakenly typed as to. Part of speech 
tagging (POS) is the task of assigning each word in 
a given sentence the part of speech it assumes in this 
sentence. For example, assign N or V to talk in the fol- 
lowing pair of sentences: Have you listened to his 
(him) talk ?. In all cases we show that our approach 
either outperforms other methods tried for these tasks 
or performs comparably to the best. 

This paper focuses on analyzing the learning prob- 
lem and on motivating and developing the learning ap- 
proach; therefore we can only present the bottom line 
of the experimental studies and the details are deferred 
to companion reports. 

The Learning Problem 

Disambiguation tasks can be viewed as general classifi- 
cation problems. Given an input sentence we would like 
to assign it a single property out of a set of potential 
properties. 

Formally, given a sentence s and a predicate p de- 
fined on the sentence, we let C — {ci, C2, ■ ■ ■ c m } be 
the collection of possible values this predicate can as- 
sume in s. It is assumed that one of the elements in 
C is the correct assignment, c(s,p). Cj can take values 



from {site, cite, sight} if the predicate p is the correct 
spelling of any occurrence of a word from this set in the 
sentence; it can take values from {v,n} if the pred- 
icate p is the attachment of the PP to the preceding 
VP (v) or the preceding NP (n), or it can take val- 
ues from {industrial, living organism} if the predicate 
is the meaning of the word plant in the sentence. In 
some cases, such as part of speech tagging, we may ap- 
ply a collection P of different predicates to the same 
sentence, when tagging the first, second, fcth word in 
the sentence, respectively. Thus, we may perform a 
classification operation on the sentence multiple times. 
However, in the following definitions it would suffice to 
assume that there is a single pre-defined predicate op- 
erating on the sentence s; moreover, since the predicate 
studied will be clear from the context we omit it and 
denote the correct classification simply by c(s). 

A classifier h is a function that maps the set S of all 
sentences^, given the task defined by the predicate p, 
to a single value in C, h : S — ► C. 

In the setting considered here the classifier h is se- 
lected by a training procedure. That is, we assume^ a 
class of functions TL, and use the training data to select 
a member of this class. Specifically, given a training cor- 
pus St r consisting of labeled example (s, c(s), a learning 
algorithm selects a hypothesis h G Tt, the classifier. 

The performance of the classifier is measured empiri- 
cally, as the fraction of correct classifications it performs 
on a set St s of test examples. Formally, 

Perf(f) = \{s G S ts \h(s) = c(s)}\/\{s G S t .}|. (1) 

A sentence s is represented as a collection of fea- 
tures, and various kinds of feature representation can 
be used. For example, typical features used in correct- 
ing context-sensitive spelling are context words - which 
test for the presence of a particular word within ±fc 
words of the target word, and collocations ~ which test 
for a pattern of up to t contiguous words and/or part- 
of-speech tags around the target word. 

It is useful to consider features as sequences of tokens 
(e.g., words in the sentence, or pos tags of the words). 
In many applications (e.g., n-gram language models), 
there is a clear ordering on the features. We define here 
a natural partial order -< as follows: for features /, g 
define / -< g = f C g, where on the right end side 
features are viewed simply as sets of tokensg. A feature 
/ is of order k if it consists of k tokens. 

A definition of a disambiguation problem consists of 
the task predicate p, the set C of possible classifications 
and the set J- of features. J-^ denotes the features of 

1 The basic unit studied can be a paragraph or any other 
unit, but for simplicity we will always call it a sentence. 

2 This is usually not made explicit in statistical learning 
procedures, but is assumed there too. 

3 There are many ways to define features and order re- 
lations among them (e.g., restricting the number of tokens 
in a feature, enforcing sequential order among them, etc.). 
The following discussion does not depend on the details; one 
option is presented to make the discussion more concrete. 



order k. Let \J-\ = n, and Xi be the zth feature. Xi can 
either be present {active) in a sentence s (we then say 
that Xi — 1), or absent from it [xi = 0). Given that, 
a sentence s can be represented as the set of all active 
features in it s = (x^ , Xi 2 , . . . Xi m ). 

From the stand point of the general framework the 
exact mapping of a sentence to a feature set will not 
matter, although it is crucially important in the spe- 
cific applications studied later in the paper. At this 
point it is sufficient to notice that the a sentence can be 
mapped into a binary feature vector. Moreover, w.l.o.g 
we assume that |C| = 2; moving to the general case is 
straight forward. From now on we will therefore treat 
classifiers as Boolean functions, h : {0,1}" — > {0,1}. 

Approaches to Disambiguation 

Learning approaches are usually categorized as statisti- 
cal (or probabilistic) methods and symbolic methods. 
However, all learning methods are statistical in the 
sense that they attempt to make inductive generaliza- 
tion from observed data and use it to make inferences 
with respect to previously unseen d ata; as such, t he sta- 
tistical based theories of learning flVapnik 1995| ) apply 
equally to both. The difference may be that symbolic 
methods do not explicitly use probabilities in the hy- 
pothesis. To stress the equivalence of the approaches 
further in the following discussion we will analyze two 
"statistical" and two "symbolic" approaches. 

In this section we present four widely used disam- 
biguation methods. Each method is first presented as 
known and is then re-cast as a problem of learning a lin- 
ear separator. That is, we show that, there is a linear 
condition ^ x .^-pWiXi > 9 such that, given a sentence 
s = (xi t , Xi 2 , . . . Xi m ), the method predicts c = 1 if the 
condition holds for it, and c = otherwise. 

Given an example s = (x%,X2 ■ ■ -x m ) a prob- 
abilistic classifier h works by choosing the ele- 
ment of C that is most probable, that is h(s) = 
argmax Ci £cP r (ci\xii ^2> . . . x m )^, where the probabil- 
ity is the empirical probability estimated from the la- 
beled training data. In general, it is unlikely that one 
can estimate the probability of the event of interest 



(a |a ti . ■ ■ ■ Xm ) directly from the training data. There 



is a need to make some probabilistic assumptions in 
order to evaluate the probability of this event indi- 
rectly, as a function of "more frequent" events whose 
probabilities can be estimated more robustly. Different 
probabilistic assumptions give rise to difference learning 



met hods and we describe two popular methods below. 



The naive Bayes es timation (NB) The naive 



Bayes estimation (e.g., (Duda & Hart 1973)) assumes 
that given the class value c G C the features values are 
statistically independent. With this assumption and 
using Bayes rule the Bayes optimal prediction is given 
by: 



4 As usual, we use the notation Pr(ci\xi, X2, ■ ■ ■ x m ) as a 
shortcut for Pr(c — a\xi = oi, xi = ai, . . . x m = dm). 



h(s) = argmax Ci ^c^iLiP r {_ x j\ c i)P( c i)- 
The prior probabilities p(ci) (i.e., the fraction of 
training examples labeled with Ci) and the conditional 
probabilities Pr[xj\ci) (the fraction of the training ex- 
amples labeled Ci in which the jth feature has value 
Xj) can be estimated from the training data fairly ro- 
bustly^], giving rise to the naive Bayes predictor. Ac- 
cording to it, the optimal decision is c = 1 when 

P(c = l)UiP(xi\c = 1)/P(c = 0)n,F(x,|c = 0) > 1. 

Denoting pi = P(xi = l|c = l),qi = P{x% = l\c = 0), 
P(c = r) = P(r), we can write this condition as 



p(i)n iP ?(i- Pi y-** PiiW-PiXi^r 



PCOW (i - a)*-** P(o)n,(i - ®)( T ^)*« 



> l, 



and by taking log we get that using naive Bayes esti- 
mation we predict c = 1 if and only if 



log 



P(l) 
P(0) 



log 



1 



Pi 1 



1 - Pi 



)xi > 0. 



We conclude that the decision surface of the naive Bayes 
algorithm is given by a linear function in the feature 
space. Points which reside on one side of the hyper- 
plane are more likely to be labeled 1 and points on the 
other side are more likely to be labeled 0. 

This representation immediately implies that this 
predictor is optimal also in situations in which the con- 
ditional independence assumption does no hold. How- 
ever, a more important consequence to our discussion 
here is the fact that not all linearly sepa rable functi ons 
can be represented using this predictor (Roth 1998). 

The back-off estimation (BO) Back-off estimation 
is another method for estimating the conditional proba- 
bilities Pr(ci\s). It has been used in many disambigua- 
tion t asks and in learning models for spee c h recogni- 
tion (|Katz 19871 ; |Chen k Goodman 1996| jCollins fc 
Brooks 1995[ ). The back-off method suggests to esti- 
mate Pr(ci\xi, X2, ■ ■ ■ , x m ) by interpolating the more 
robust estimates that can be attained for the condi- 
tional probabilities of more general events. Many vari- 
ation of the method exist; we describe a fair ly general 
one and the n present the version used in (Collins & 



Brooks 1995), which we compare with experimentally. 



When applied to a disambiguation task, BO assumes 
that the sentence itself (the basic unit processed) is a 



5 Problems of sparse data may arise, though, when a spe- 
cific value of Xi observed in testing has occurred infrequently 
in the training, in conjunction with Cj. Various smoothing 
techniques can be employed to get more robust estimations 
but these considerations will not affect our discussion and 
we disregard them. 



feature^ of maximal order / = /W g J 7 . We estimate 

Pr(«\8) = Pr(*\fW) = X f Pr ^\f)- 

{/e^l/^/ (fe) } 

The sum is over all features / which are more general 
(and thus occur more frequently) than /' fe '. The condi- 
tional probabilities on the right arc empirical estimates 
measured on the training data, and the coefficients A/ 
are also estimated given the training data. (Usually, 
these are maximum likeliho od estimates eval uated us- 
ing iterative methods, e.g. ( Samuclsson 1996] )) . 

Thus, given an example s = (xi, x-i . . . x m ) the BO 
method predicts c = 1 if and only if 



E 

i=l 



A,(Pr(c = l\ Xi ) - Pr{c = 0\xi))xi > 0, 



a linear function over the feature space. 

For computational reasons, various simplifying as 
sumptions are made in order to estimate the c oefficients 
A/; we describe here the method used in flCoilins fc 



Bro frks 1995| n We denote by Af(f {j) ) the number of 
occurrences of the jth order feature /w in the training 
data. Then BO estimates P = Pr(c.i\f( k >) as follows: 

If JV(/ W ) > 0, P = Pr(c t \f V) 

Else if E /e ^-U XV) > 0, P = |^=TT| E Pr( Ci \f^) 

Else if . . . 

Else ifE fe> d)^(/) >n, P- |JinE^(^ l / 



In this case, it is easy to write down the linear sep- 
arator defining the estimate in an explicit way. Notice 
that with this estimation, given a sentence s, only the 
highest order features active in it are considered. There- 
fore, one can define the weights of the jth order feature 
in an inductive way, making sure that it is larger than 
the sum of the weights of the smaller order features. 
Leaving out details, it is clear that we get a simple rep- 
resentation of a linear separator over the feature space, 
that coincides with the BO algorithm. 

It is important to notice that the assumptions made 
in the BO estimation method result in a linear decision 
surface that is, in general, different from the one derived 
in the NB method. 

Transformation Based Learning (TBL) Trans- 
formation based learning ([Brill 1995|) is a machine 
learning approach for rule learning. It has been ap- 
plied to a number of natural language disambiguation 
tasks, often achieving state-of-the-art accuracy. 



6 The assumption that the maximal order fea ture is the 
[itence is made, for example, in (ICollins & 

tfr 



In general, the method deals with multiple 
features of the maximal order by assuming their conditional 
independence, and superimposing the NB approach. 

7 There, the empirical ratios are smoothed; experimen- 
tally, however, this yield only a slight improvement, going 
from 83.7% to 84.1% so we present it here in the pure form. 



The learning procedure is a mistake-driven algorithm 
that produces a set of rules. Irrespective of the learning 
procedure used to derive the TBL representation, we 
focus here on the final hypothesis used by TBL and how 
it is evaluated, given an input sentence, to produce a 
prediction. We assume, w.l.o.g, \C\ = 2. 

The hypothesis of TBL is an ordered list of transfor- 
mations. A transformation is a rule with an antecedent 
t and a consequent^] c £ C. The antecedent t is a con- 
dition on the input sentence. For example, in Spell, 
a condition might be word W occurs within ±fe of 
the target word. That is, applying the condition to 
a sentence s defines a feature t(s) £ T . Phrased differ- 
ently, the application of the condition to a given sen- 
tence s, checks whether the corresponding feature is 
active in this sentence. The condition holds if and only 
if the feature is active in the sentence. 

An ordered list of transformations (the TBL hypoth- 
esis), is evaluated as follows: given a sentence s, an 
initial label c G C is assigned to it. Then, each rule is 
applied, in order, to the sentence. If the feature defined 
by the condition of the rule applies, the current label is 
replaced by the label in the consequent. This process 
goes on until the last rule in the list is evaluated. The 
last label is the output of the hypothesis. 

In its mos t general se tting, the TBL hypothesis is not 
a classifier ( Brill 1995|) . The reason is that the truth 
value of the condition of the ith rule may change while 
evaluating one of the preceding rules. Ho wever, in many 
appli cations and, in particular, in Sp ell (Mangu & Brill 



1997) and PPA (Brill fc Resnik 1994) which we discuss 



later, this is not the case. There, the conditions do not 
depend on the labels, and therefore the output hypoth- 
esis of the TBL method can be viewed as a classifier. 
The following analysis applies only for this case. 

Using the terminology introduced above, let 
, Cij), (xi 2 , Cj 2 ), . . . (xi k , Ci k ) be the ordered sequence 
of rules defining the output hypothesis of TBL. (Notice 
that it is quite possible, and happens often in practice, 
for a feature to appear more than once in this sequence, 
even with different consequents). While the above de- 
scription calls for evaluating the hypothesis by sequen- 
tially evaluating the conditions, it is easy to see that 
the following simpler procedure is sufficient: 

Search the ordered sequence in a reversed order. Let 
Xi j be the first active feature in the list (i.e., the 
largest j). Then the hypothesis predicts fc. . 

Alternatively, the TBL hypothesis can b e represente d 
as a (positive) 1-Decision-List (pl-DL) (Rivest 1987), 
over the set T of features^. Given the pl-DL represen- 

8 The consequent is sometimes described as a transforma- 
tion d — > Cj , with the semantics - if the current label is d , 
relabel it Cj. When \C\ = 2 it is equivalent to simply using 
Cj as the consequent. 

9 Notice, the order of the features is reversed. Also, mul- 
tiple occurrences of features can be discarded, leaving only 
the last rule in which this feature occurs. By "positive" we 
mean that we never condition on the absence of a feature, 



If 

Else 
Else 
Else 
Else 



active then predict c fe . 
If Xi k _ 1 is active then predict Ck-i- 

If x\ is active then predict ci. 
Predict the initial value 

Figure 1: TBL as a pi-Decision List 



tation (Fig[l]), we can now represent the hypothesis as a 
linear separator over the set T of features. For simplic- 
ity, we now name the class labels {—1, +1} rather than 
{0, 1}. Then, the hypothesis predicts c = 1 if and only 

if Ylj=i 2 J ' c ij ' x ij > 0- Clearly, with this representation 
the active feature with the highest index dominates the 
prediction, and the representations are equivalent^*} 

Decision Lists (pl-DL) It is easy to see (details 
omitted), that the above analy sis applies to pl -DL, a 
method used, for example, in ( Yarowsky 1995| ). The 
BO and pl-DL differ only in that they keep the rules 
in reversed order, due to different evaluation methods. 

The Linear Separator Representation 

To summarize, we have shown: 

claim: All the methods discussed - NB, BO, TBL and 
pl-DL search for a decision surface which is a linear 
function in the feature space. 

This is not to say that these methods assume that 
the data is linearly separable. Rather, all the methods 
assume that the feature space is divided by a linear 
condition (i.e., a function of the form Y^ x -e3 rWiXi ^ ^) 
into two regions, with the property that in one of the 
defined regions the more likely prediction is and in 
the other, the more likely prediction is 1. 

As pointed out, it is also instructive to see that these 
methods yield different decision surfaces and that they 
cannot represent every linearly separable function. 

Theoretical Support for the Linear 

Separator Framework 

In this section we discuss the implications these obser- 
vations have from the learning theory point of view. 
In order to do that we need to resort to some of 



good performance on the training corpus guarantees 
good performance on the test corpus. 

If one knows something about the model that gener- 
ates the data, then estimating this model may yield 
good performance on future examples. However, in 
the problems considered here, no reasonable model is 
known, or is likely to exist. (The fact that the assump- 
tions discussed above disagree with each other, in gen- 
eral, may be viewed as a support for this claim.) 

In the absence of this knowledge a learning method 
merely attempts to make correct predictions. Under 
these conditions, it can be shown that the error of 
a classifier selected from class TL on (previously un- 
seen) test data, is bounded by the sum of its train- 
ing error and a function that depends linearly on the 
complexity of TL. This complexity is measured in 
terms of a combi natorial param eter - the VC-dimension 
of the class TL ( Vapnik 1982| ) - which measures the 
richness of the function class. (See ( Vapnik 1995| ; 



Kcarns & Vazirani 1992)) for details) 

We have shown that all the methods considered here 
look for a linear decision surface. However, they do 
make further assumptions which seem to restrict the 
function space they search in. To quantify this line of 
argument we ask whether the assumptions made by the 
different algorithms significantly reduce the complexity 
of the hypothesis space. The following claims show that 
this is not the case; the VC dimension of the function 
classes considered by all methods are as large as that 
of the full class of linear separators. 
Fact 1: The VC dimension of the class of linear sepa- 
rators over n variables is n + 1. 

Fact 2: The VC dimension of the class of pl-DL over 
n variables^ is n + 1. 

Fact 3: The VC dimension of the class of linear sepa- 
rators derived by either NB or BO over n variables is 
bounded below by n. 

Fact 1 is well kno wn; 2 and 3 can be derived directly 
from the definition QRoth 1998| ). 

The implication is that a method that merely 
searches for the optimal linear decision surface given 
the training data may, in general, outperform all these 
methods also on the test data. This argum ent can be 
made formal b y appealing to a result of (Kearns & 



do we hope that a classifier learned from the training 
corpus will perform well (on the test data ) ? Informally . 



the [basic ideas that justify inductive learning. Why Schapire 1994|), which shows that even when there is 



the basic th eorem of learning theory ( Valiant 1984 ; 
Vapnik 1995) guarantees that, if the training data and 
the test data are sampled from the same distribution]^], 

only on its presence. 

10 In practice, there is no need to use this representation, 
given the efficient way suggested above to evaluate the clas- 
sifier. In addition, very few of the features in T are active in 
every example, yield ing more efficient evaluation techniques 
(e.g., ( |Valiant 1998]) ) 

11 This is hard to define in the context of natural language; 
typically, this is understoo d as texts of similar na ture: see a 
discussion of this issue in dGolding fc Roth 1996| ). 



no perfect classifier, the optimal linear separator on a 
polynomial size set of training examples is optimal (in 
a precise sense) also on the test data. 

The optimality criterion we seek is described in Eq. 
1. A linear classifier that minimizes the number of dis- 
agreements (the sum of the false positives and false neg- 
atives classifications). This task, however, is known to 



12 In practice, when using pl-DL as the hypothesis class 
(i.e., in TBL) an effort is made to discard many of the fea- 
tures and by that reduce the complexity of the space; how- 
ever, this process, which is data driven and does not a-priori 
restrict the functi on class can be employed by other meth- 
ods as well (e.g., (Blum 1995)) and is therefore orthogonal 
to these arguments. 



be NP-hard (Hoffgen fc Simon 1992), so we need to re- on the number of total and relevant a ttributes (Little- 



sort to heuristics. In searching for good heuristics we stone 1991; Kivinen & Warmuth 1995) 



are guided by computational issues that are relevant to 
the natural language domain. An essential property of 
an algorithm is being feature-efficient. Consequently, 
the approach describe in the next section makes use of 
the Winnow algorithm which is known to produce good 
results when a linear separator exi sts, as well as un der 
certain more relaxed assumptions (Littlestone 1991). 



The SNOW Approach 

The SNO ^architecture is a network of threshold gates. 
Nodes in the first layer of the network are allocated to 
input features in a data-driven way, given the input 
sentences. Target nodes (i.e., the element c £ C) are 
represented by nodes in the second layer. Links from 
the first to the second layer have weights; each target 
node is thus defined as a (linear) function of the lower 
level nodes. (A similar architect ure which consists of a n 
additional layer is described in ( Golding fc Roth 1996 ). 
Here we do not use the "cloud" level described there.) 

For example, in Spell, target nodes represent mem- 
bers of the confusion sets; in POS, target nodes corre- 
spond to different pos tags. Each target node can be 
thought of as an autonomous network, although they 
all feed from the same input. The network is sparse in 
that a target node need not be connected to all nodes 
in the input layer. For example, it is not connected to 
input nodes (features) that were never active with it in 
the same sentence, or it may decide, during training to 
disconnect itself from some of the irrelevant inputs. 

Learning in SNOW proceeds in an on-line fashion)^]. 
Eve ry example is treated autonomously by each tar- 



get subnetworks. Every labeled example is treated as 



positive for the target node corresponding to its label 
and as negative to all others. Thus, every example is 
used once by all the nodes to refine their definition in 
terms of the others and is then discarded. At prediction 
time, given an input sentence which activates a subset 
of the input nodes, the information propagates through 
all the subnetworks; the one which produces the highest 
activity gets to determine the prediction. 

A local learning algorithm, Winnow ( Littlestone 
198|), is used at each target node to learn its depen- 
dence on other nodes. Winnow is a mistake driven 
on-line algorithm, which updates its weights in a mul- 
tiplicative fashion. Its key feature is that the num- 
ber of examples it requires to learn the target function 



Notice that even when there are only two target nodes 
and the cloud size ( folding fe Roth 1996j ) is 1 SNOW 
behaves differently than pure Winnow. While each of 
the target nodes is learned using a positive Winnow al- 
gorithm, a winner-take-all policy is used to determine 
the prediction. Thus, we do not use the learning al- 
gorithm here simply as a discriminator. One reason is 
that the SNO W architectu re, influenced by the Neu- 
roidal system (Valiant 1994), is being used in a system 
developed for the purpose of learning knowledge rep- 
resentations for natural language understanding tasks, 
and is being evaluated on a variety of tasks for which 
the node allocation process is of importance. 

Experimental Evidence 

In this section we present experimental results for 
three of the most well studied disambiguation prob- 
lems, Spell, PPA and POS. We present here only 
the bottom-line results of an e xtensive study that ap- 
pears in companion repor t s (Golding fe Roth 1998 ]; 
Krymolovsky fc Roth 1998) ; |Roth fe Zclcnko 1998^ 

Context Sensitive Spelling Correction Context- 
sensitive spelling correction is the task of fixing spelling 
errors that result in valid words, such as It's not to late, 
where too was mistakenly typed as to. 

We model the ambiguity among words by confusion 
sets. A confusion set C = {ci, . . . , c n } means that each 
word Ci in the set is ambiguous with each other word. 
All the results report ed here use the same pre-defined 

set of confusion sets ( Golding fe Roth 1996|). 

We compare SNOW against TBL ( [Mangu fc Brill 



1997) and a naive-Bayes based system (NB). The latter 



system presents a few augmentations over the simple 
naive Bayes (but still shares the same basic assump- 
tions) and is am ong the most s uccessful methods tried 
for the problem (Golding 1995). An indication that a 
Winnow-based algori thm performs well on this prob- 
lem was presented in (Golding & Roth 1996). However, 
the system presented there was more involved than 
SNOW and allows more expressive output represen- 
tation than we allow here. The output representation 
of all the approaches compared is a linear separator. 

The results presented in Table for NB and SNOW 
are the (weighted) average results of 21 confusion sets, 
19 of them are of size 2, and two of size 3. The results 
presented for the TBL0 method are taken from ( Mangu 



gro ^s linearly with the number of relevant attributes fc Brill 1997] ) and represent an average on a subset of 



and only logarithmically with the total number of at- 
tributes. Winnow was shown to learn efficiently any 
linear threshold function and to be robust in the pres- 
ence of various kinds of noise, and in cases where no 
linear-threshold function can make perfect classifica- 
tions and still maintain its abovementioned dependence 



14 of these, all of size 2. 

Prepositional Phrase Attachment The problem 
is to decide whether the Prepositional Phrase (PP) at- 
taches to the noun phrase, as in Buy the car with 



Systems are compared on the same fea ture set. TBL 



13 Although for the purpose of the experimental study we 
do not update the network while testing. 



was a lso used with an enhanced feature set (Mangu & Brill 



1997) with improved results of 93.3% but we have not run 
the other systems with this set of features. 



Table 1: Spell System comparison. The second 
column gives the number of test cases. All algorithms 
were trained on 80% of Brown and tested on the other 
20%; Baseline simply identifies the most common mem- 
ber of the confusion set during training, and guesses it 
every time during testing. 



Sets 


Cases 


Baseline 


NB 


TBL 


SNOW 


14 


1503 


71.1 


89.9 


88.5 


93.5 


21 


4336 


74.8 


93.8 




96.4 



Table 2: PPA System comparison. All algorithms 
were trained on 20801 training examples from the WSJ 
corpus tested 3097 previously unseen examples from 
this corpus; all the system use the same feature set. 



Test 


Baseline 


NB 


TBL 


BO 


SNOW 


cases 












3097 


59.0 


83.0 


81.9 


84.1 


83.9 



the steering wheel or the verb phrase, as in Buy the 
car with his money. Earlier works on this problem 



( Ratnaparkhi, Rcvnar. fc Roukos 1994 ; Brill & Resnik 
199 Collins fc Brooks 1995 ) consider as input the 
four head words involved in the attachment - the VP 
head, the first NP head, the preposition and the second 
NP head (in this case, buy, car, with and steering 
wheel, respectively). These four-tuples, along with the 
attachment decision constitute the labeled input sen- 
tence and are used to generate the feature set. The 
features recorded are all sub-sequences of the 4-tuple, 
total of 15 for every input sentence. The data set used 
by all the systems in this in this comparison wa s ex- 
trac ted from the Pcnn Trccbank W SJ corpus by ( Rat- 
nap; irkhi, Reynar, & Roukos 1994). It consists of 20801 



training examples and 3097 separate test examples. In 
a companion paper we describe an extensive set of ex- 
periments with this and other data sets, under various 
conditions. Here we present only the bottom line results 
that provide direct comparison with those available in 
the literatureP. The results presented in Table for 
NB and SNOW are the results of our system on the 
3097 test examples. The results presented for t he TBL 
and BO are on the same data set, taken from (Collins 
& B[rooks 1995|). 



Part of Speech Tagging A part of speech tagger 
assigns each word in a sentence t he part of s peech that 
it assumes in that sentence. See (Brill 1995) for a sur- 
vey of much of the work that has been done on POS in 
the past few years. Typically, in English there will be 
between 30 and 150 different parts of speech depending 
on the t agging sch eme. In the study presented here, fol- 
lowing (Brill 1995) and many other studies there are 47 
different tags. Part-of-speech tagging suggests a special 



' SNOW was evaluated with an enhanced feature set 



f krvmolovskv fc Roth 1998[ ) with improved results of 84.8%. 
( |Collms fc Brooks 1995 ) reports results of 84.4% on a dif- 
ferent enhanced set of features, but other systems were not 
evaluated on these sets. 



Table 3: POS System comparison. The first col- 
umn gives the number of test cases. All algorithms 
were trained on 550, 000 words of the tagged WSJ cor- 
pus. Baseline simply predicts according to the most 
common pos tag for the word in the training corpus. 



Test 


Baseline 


TBL 


SNOW 


cases 








250,000 


94.4 


96.9 


96.8 



challenge to our approac h, as the problem is a multi- 
class prediction problem ( Roth fc Zelenko 1998 ). In the 
SNOW architecture, we devote one linear separator to 
each pos tag and each sub network learns to separate its 
corresponding pos tag from all others. At run time, all 
class nodes process the given sentence, applying many 
classifiers simultaneously. The classifiers then compete 
for deciding the pos of this word, and the node that 
records the highest activity for a given word in a sen- 
tence determines its pos. The metho ds compare d use 
context and collocation features as in ( Brill 1995| ). 

Given a sentence, each word in the sentence is as- 
signed an initial tag, based on the most common part 
of speech in the training corpus. Then, for each word in 
the sentence, the network processes the sentence, and 
makes a suggestion for the pos of this word. Thus, the 
input for the predictor is noisy, since the initial assign- 
ment is not accurate for many of the words. This pro- 
cess can repeat a few times, where after predicting the 
pos of a word in the sentence we re-compute the new 
feature-based representation of the sentence and predict 
again. Each time the input to the predictors is expected 
to be slightly less noisy. In the results presented here, 
however, we present the performance without the re- 
cycling process, s o that we maintain the linear function 
expressivity (see (Roth & Zelenko 1998) for details). 

The results presented in Table are based on ex- 
periments using 800, 000 words of the Penn Treebank 
Tagged WSJ corpus. About 550, 000 words were used 
for training and 250,000 for testing. SNOW and TBL 
were trained and tested on the same data. 

Conclusion 

We presented an analysis of a few of the commonly 
used statistics based and machine learning algorithms 
for ambiguity resolution tasks. We showed that all the 
algorithms investigated can be re-cast as learning lin- 
ear separators in the feature space. We analyzed the 
complexity of the function space in which each of these 
method searches, and show that they all search a space 
that is as complex as the space of all linear separa- 
tors. We used these to argue motivate our approach of 
learning a sparse network of linear separators (SNOW), 
which learns a network of linear separator by utiliz- 
ing the Winnow learning algorithm. We then presented 
an extensive experimental study comparing the SNO W 
based algorithms to other methods studied in the liter- 
ature on several well studied disambiguation tasks. We 
present experimental results on Spell, PPA and POS. 



In all cases we show that our approach either outper- 
formed other methods tried for these tasks or performs 
comparably to the best. We view this as a strong ev- 
idence to that this approach provides a unified frame- 
work for the study of natural language disambiguation 
tasks. 

The importance of providing a unified framework 
stems from the fact the essentially all ambiguity resolu- 
tion problems that are addressed here are at the lower 
level of the natural language inferences chain. A large 
number of different kinds of ambiguities are to be re- 
solved simultaneously in pe rforming any higher level 
natural language inference ( Cardie 1996 ). Naturally, 
these processes, acting on the same input and using the 
same "memory", will interact. A unified view of ambi- 
guity resolution within a single architecture, is valuable 
if one wants understand how to put together a large 
number of these inferences, study interactions among 
them and make progress towards using these in per- 
forming higher level inferences. 
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